Data Science project Boilerplate

Every time I start a new data science project, I go through the same setup steps. Create folders, set up the virtual environment, add a .gitignore, write the Dockerfile. It takes time, and worse, when I don’t follow a consistent structure I end up with projects that are hard to navigate six months later.

The structure below is what I use. It’s not revolutionary β€” it’s opinionated in ways that work for me β€” but having a standard starting point saves time and keeps things consistent across projects. Here it is.

The folder structure

.
β”œβ”€β”€ data
β”‚   └── data.csv
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ Makefile
β”œβ”€β”€ .env
β”œβ”€β”€.envrc
β”œβ”€β”€ .gitignore
β”œβ”€β”€ notebooks
β”‚   └── datascientist_deliverable.ipynb
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ scripts
β”‚   └── script.py
β”œβ”€β”€ setup.py
└── thepkg
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ interface
    β”‚   └── __init__.py
    β”œβ”€β”€ ml_logic
    β”‚   β”œβ”€β”€ data.py
    β”‚   β”œβ”€β”€ __init__.py
    β”‚   β”œβ”€β”€ model.py
    β”‚   └── preprocessor.py
    β”œβ”€β”€ params.py
    └── utils.py

Here’s what each piece does and why I include it:

  • data: The project’s data files. Keep raw data here and don’t overwrite it β€” treat it as read-only once ingested.

  • Dockerfile: Defines the container environment. I develop inside Docker containers (see my post on Docker + bind mounts), so this is a key part of every project, not optional scaffolding.

  • Makefile: Automates common tasks β€” building the image, running tests, launching the notebook server. A well-written Makefile means you don’t have to remember long docker run commands.

  • .env: Environment variables β€” API keys, database connection strings, anything that shouldn’t be hardcoded. Never commit this file.

  • .envrc: Works with direnv to automatically load the environment when you enter the project directory. Makes switching between projects less error-prone.

  • .gitignore: Keeps the repository clean. At minimum: .env, __pycache__, .ipynb_checkpoints, and whatever data files are too large to version.

  • notebooks: For exploration and deliverables. One notebook per logical question. Don’t use notebooks as a substitute for proper code β€” anything reusable goes into thepkg.

  • README.md: How to set up and run the project. Assume the reader has never seen your code before, because in three months that reader will be you.

  • requirements.txt: Pinned package versions. Reproducibility starts here.

  • scripts: Standalone scripts β€” data ingestion, batch jobs, one-off transformations. These call into thepkg rather than containing business logic directly.

  • setup.py: Lets you install thepkg as a local editable package with pip install -e .. Once it’s installed, you can import it cleanly from notebooks and scripts without path hacks.

  • thepkg: The actual Python package where reusable code lives. Organized into:

    • interface: Entry points and API definitions.
    • ml_logic: Data loading (data.py), preprocessing (preprocessor.py), and model code (model.py). Keeping these separate makes it easier to swap out one component without touching the others.
    • params.py: Configuration constants β€” model hyperparameters, file paths, column names. Centralizing these means you change things in one place.
    • utils.py: Utility functions that don’t belong anywhere else.

The folder structure is a template, not a contract. Some projects need more; some need less. But starting here is faster than starting from scratch, and it’s easier to remove structure you don’t need than to add it retroactively.

Below is a bash script to create the entire structure in one shot.

Code
#!/bin/bash

# Create the main project directory
mkdir -p my_data_science_project

# Create subdirectories and files
cd my_data_science_project

mkdir -p data
touch data/data.csv

touch Dockerfile
touch Makefile
touch .env
touch .envrc
touch .gitignore

mkdir -p notebooks
touch notebooks/datascientist_deliverable.ipynb

touch README.md
touch requirements.txt

mkdir -p scripts
touch scripts/script.py

touch setup.py

mkdir -p thepkg
touch thepkg/__init__.py

mkdir -p thepkg/interface
touch thepkg/interface/__init__.py

mkdir -p thepkg/ml_logic
touch thepkg/ml_logic/data.py
touch thepkg/ml_logic/__init__.py
touch thepkg/ml_logic/model.py
touch thepkg/ml_logic/preprocessor.py

touch thepkg/params.py
touch thepkg/utils.py